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Pre  face 


This  paper  presents  an  investigation  into  the  string 
pattern  matching  process  and  in  particular,  the  realization 
of  the  process  through  use  of  various  data  structures  and 
algorithms  on  the  CDC  CYBER  74  computer.  The  presentation 


presumes  the  reader  is  familiar  with  standard  computer 
terminology  and  has  a basic  understanding  of  computer 
operation.  A knowledge  of  finite  state  machines  would  be 
helpful  during  the  discussion  of  alternate/successor  linked 
list  and  finite  state  automata  data  structures  but  is  not  a 
prerequisite  to  their  understanding. 

As  result  of  this  work  much  basic  insight  into  the 
pattern  matching  process  has  been  developed  and  some  inter- 


esting conclusions  have  been  reached.  The  work  done  in 
evaluation  of  the  various  implementations  was  time  consuming 
but  enjoyable  and  forced  a methodical  approach  in  construc- 


ting and  comparing  the  many  test  programs.  This  approach 
I will  be  able  to  apply  in  future  work  and  is  something 


that  only  experience  can  teach.  To  this  end  I wish  to 
acknowledge  the  guidance  and  interest  of  my  thesis  advisor. 
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Abstract 


i 

A description  of  the  discrete-pattern  matching  process 
is  presented  with  the  key  elements  described.  Six  data 
structure  approaches  and  related  search  algorithms  are 
presented.  These  include  a simple  list,  an  indexed  simple 
list,  a linked  list,  a binary  tree,  an  alternate/successor 
linked  list,  and  a finite  state  automata. 

Twelve  programs  were  coded  to  implement  five  out  of  the 

six  data  structures/algorithms  (linked  list  not  used)  using 

packed  and  unpacked  approaches  on  a CYBER  74.  Runs  were 

made  with  twelve  different  data  files  using  Fortran,  English, 

and  random  text.  The  effect  of  the  number  of  patterns  in 

the  data  structure  and  the  expected  incidence  in  the  text 

were  included.  Results  indicate  a "best"  data  structure/ 

algorithm  may  be  chosen  from  three  implementations:  an 

! unpacked  finite  state  automata  approach,  an  unpacked  alter- 

I 

j nate /successor  linked  list  approach,  and  a packed  version 

i of  the  latter. 


DISCRETE-PATTERN  MATCHING  ALGORITHMS  AND 
DATA  STRUCTURES  FOR  CYBER  74 

I . Introduction 

This  thesis  will  investigate  several  different  algo- 
rithms and  data  structures  used  to  implement  the  pattern 
matching  process  with  discrete-patterns.  These  algorithms 
and  data  structures  were  implemented  on  a Control  Data 
Corporation  (CDC)  CYBER  74  computer. 

A ''discrete-pattern”  is,  in  the  context  of  this  paper, 
a finite  string  of  characters.  For  example,  a single  word, 
this  line  of  text,  or  for  that  matter,  this  entire  page  may 
be  considered  a discrete-pattern  since  each  is  a finite 
string  of  characters.  But,  for  the  remainder  of  this  paper, 
rather  than  referring  to  a "discrete-pattern"  each  time,  the 
term  "pattern"  will  be  used  instead. 

The  Pattern  Matching  Process 

The  concept  of  the  pattern  matching  process  is  illustra- 
ted in  figure  1.  The  two  key  elements  which  will  be  studied 
in  this  paper  are  the  matching  algorithm  and  the  data  struc- 
ture. Also  of  some  interest  will  be  the  type  of  input  string 
and  what  influence  it  may  have  on  the  overall  process . 

Briefly,  the  pattern  matching  process  is  performed  by 
the  matching  (or  search)  algorithm.  This  algorithm  "searches" 
the  input  subject  string  for  occurrences  of  patterns  which 
are  stored  in  the  data  structure.  The  results  of  the  search 
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Figure  1.  The  Pattern  Matching  Process 


are  output  as  the  end  product  of  the  match  process. 

The  exact  organization  of  the  data  structure'  can  vary 
from  one  application  to  another,  and  so,  with  it,  the  match 
algorithm  varies  also.  This  thesis  will  attempt  to  identify 
efficient  combinations  of  data  structure /matching  algorithm 
for  use  on  the  CYBER  74. 

Pattern  Matching  Applications 

Pattern  matching  is  an  integral  concept  in  many  different 
computer  applications  programs.  In  fact,  for  some  uses  the 
matching  algorithm  is  the  speed  limiting,  and  hence,  efficiency 
determining  portion  of  the  application.  For  example,  a 
language  compiler  during  its  lexical  analysis  is  essentially 
i a pattern  matching  algorithm;  it  searches  an  input  subject 

' string  (source  program)  for  occurrences  of  specific  patterns 

j (such  as  keywords  and  variable  names),  and  outputs  an 

j analyzed  program  for  subsequent  use  in  the  remaining  phases. 


I 
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The  pattern  matching  process  is  applied  in  the  latter  phases 
also  but  to  a lesser  degree.  In  any  event,  the  speed  of  the 
language  compiler  is  closely  related  to  the  speed  of  its 
pattern  matching  portion. 

Macro-processors  (or  preprocessors  as  they  are  often 
called)  also  rely  on  the  lattern  matching  process.  A macro- 
processor allows  a user  to  specify  his  own  syntax  for  a 
standard  language,  say,  like  Fortran.  That  is,  the  user 
"tells"  the  macro-processor  what  syntax  constructs  to  look 
for  (patterns)  and  what  is  to  be  done  as  the  result  of  a 
pattern  match.  As  a very  simple  example,  suppose  the  user 
would  like  the  occurrence  of  the  string  "QUIT"  in  the  source 
coding  to  mean  halt  processing  and  close  all  files.  He  would 
therefore  identify  to  the  macro-processor  the  pattern  "QUIT" 
and  the  coding  necessary  to  perform  the  stop  and  close  files 
operations.  (This  is  done  within  the  language  of  the  macro- 
processor.) Thus,  when  a source  program  is  submitted  to  the 
macro-processor,  it  is  searched  for  occurrences  of  the  pattern 
"QUIT"  and  at  each  such  occurrence  the  associated  coding  would 
replace  the  string  "QUIT".  This  updated  (preprocessed)  code 
is  then  sent  to  the  regular  language  compiler  which  generates 
the  final  object  code. 

However,  the  previous  example  has  perhaps  over  simplified 
the  sometimes  complex  problem  of  pattern  definition  and 
matching.  For  instance,  what  if  the  user  would  like  a pattern 
"IF#THEN"  to  match  all  occurrences  of  an  "IF"  follwed  by  a 
"THEN"  where  represents  an  arbitrary  number  of  interceding 
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characters.  How  is  this  pattern  stored  in  the  data  structure? 
And  how  is  the  match  algorithm  to  determine  a success  or 
failure? 

Identical  situations  exist  in  the  string  and  list 
manipulating  languages  like  SNOBOL  and  LISP.  Much  work  has 
been  done  in  the  study  of  such  patterns  and  their  represen- 
tations (Ref  6).  Even  faster  versions  of  SNOBOL  (such  as 
SPITBOL)  have  been  developed,  all  of  which  owe  much  of  their 
speedier  operation  to  the  design  of  the  pattern  data  struc- 
ture and  associated  matching  algorithm. 

Another  user  of  the  pattern  matching  process  is  the 
familiar  text  editor.  Its  use  of  the  matching  process  is 
obvious.  For  example,  when  using  the  text  editor,  one  is 
often  scanning  text  (e.g.  source  code  for  a program)  for 
occurrence  of  a string  of  characters  and  then  replacing  this 
string  with  another  or  even  deleting  it. 

Bibliographic  search  is  an  amplified  text  editor  scan- 
ning problem.  In  this  case,  a large  amount  of  input  text, 
perhaps  an  entire  book,  is  searched  for  occurrences  of  key 
words  (stored  in  a data  structure)  in  order  to  identify 
references  to  these  keywords. 

Thus,  the  pattern  matching  process  can  be  seen  as  an 
essential  part  of  several  computer  applications  programs: 

1.  Compilers 

2.  Macro-processors 

3.  String  and  list  manipulating  languages 

4.  Text  editors 
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But,  there  are  other,  perhaps  not  so  obvious,  uses. 

These  are  within  the  operating  system  of  the  computer, 
example,  core  allocation  can  use  pattern  matching  to  match 
a requested  amount  of  memory  (input  subject  string)  against 
available  blocks  (patterns)  in  a list  of  available  memory 
(data  structure).  Another  example,  file  management  algorithms 
can  locate  files  by  matching  the  queried  name  (subject  string) 
against  the  names  of  current  files  (patterns)  in  the  list  of 
file  names  (data  structure). 

So,  it  is  evident  that  the  pattern  matching  process  is 
an  important  function  of  computer  operations,  both  from  the 
user's  viewpoint  (applications  programs)  and  from  the  analyst's 
viewpoint  (operating  system). 

Thesis  Objective 

The  purpose  of  this  thesis  is  to  investigate  the  perfor- 
mance of  various  data  structures  and  associated  pattern  match- 
ing algorithms  within  the  hardware  constraints  of  the  CDC 
CYBER  74  computer.  Specifically,  the  investigation  has  been 
limited  to  pattern  matching  as  described  for  bibliographic 
search  and  text  editing.  That  is,  the  discrete-pattern  is 
simply  a string  of  characters,  and  therefore,  the  effect  of 
substitution  characters  (such  as  the  mentioned  before) 

and  other  advanced  concepts  will  not  be  evaluated.  Such 
topics  cannot  adequately  be  studied  until  the  simpler  cases 
are  well  understood.  Thus,  this  thesis  will  provide  a firm 
foundation  for  future  studies  into  the  implementation  of  such 
complex  pattern  representations  on  the  CYBER  74.  (See  the 
recommendations  chapter.) 
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Thesis  Approach 

Therefore,  with  this  focus  established,  the  thesis 
must  first  identify  the  various  data  structures  available 
for  use  in  the  pattern  matching  process  and  the  effect  these 
structures  have  on  the  matching  algorithm.  Then,  these  data 
structures  and  algorithms  must  be  implemented  on  the  CYBER  74 
and  their  relative  performances  evaluated.  Key  features  of 
the  CYBER  74  hardware  will  be  identified  and  used  during  this 
evaluation,  hopefully  providing  insight  into  more  complex 
future  implementations. 

Data  Structures  and  Associated  Algorithms 

There  are  many  different  data  structures  which  may  be 
used  in  the  pattern  matching  process.  With  each  unique  data 
structure  there  is  a corresponding  match  algorithm  which 
interfaces  with  the  structure.  Some  data  structure  designs 
will  store  one  character  per  word  of  computer  storage;  others 
will  store  multiple  characters.  Each  requires  different 
handling  by  the  match  algorithm. 

Simple  List.  Perhaps  the  least  complicated  data  struc- 
ture to  work  with  is  shown  in  figure  2.  In  this  example, 
four  patterns  are  stored  in  the  data  structure.  Notice  that 
each  character  of  each  pattern  occupies  a single  word  of 
storage.  But,  also  notice  the  other  entries  in  the  data 
structure.  These  integers  are  the  values  of  pattern  le;  •'th 
(LEi)  and  pattern  number  (PNi)  associated  with  each  pattern, 
i,  in  the  structure.  This  structure  is  a simple  list. 


6 


Pattern# 


Pattern 


1 

2 

3 

4 


HE 

SHE 

HAT 

THEY 


Data  Structure 


PNf 

LE2 


PN2 

LE3 


PN3 

LE4 


PN4 


LE-  Length  in  characters 
PN-  Pattern  number 


Location  # 


1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 


2 

H 

E 

1 

3 

S 

H 

E 

2 

3 

H 

A 

T 

3 

4 

T 

H 

E 

Y 

4 

The  pattern  number  represents  that  piece  of  information 
which  is  uniquely  associated  with  each  pattern.  It  could 
easily  be  a pointer  to  an  associated  routine  or  other 
function.  The  pattern  length,  however,  may  be  considered 
an  unnecessary  but  nice  to  have  piece  of  data.  Its  value 
is  included  since  it  is  known  at  the  time  the  pattern  is 
stored  and  therefore  requires  no  overhead  in  calculation. 

Its  convenience  lies  in  its  use  as  a "stepping"  value  so  that 
one  may  search  the  structure  by  adding  lengths.  (This  is 
opposed  to  a search  made  by  examining  each  storage  word  for 
occurrence  of  a pattern  number  (stored  in  distinctive  form 
such  as  a negative  value)  which  would  then  indicate  the 
beginning  or  end  of  a pattern.) 

These  two  pieces  of  information,  pattern  number  and 
pattern  length,  will  be  associated  with  each  pattern  in  all 
of  the  following  examples. 

The  pattern  matching  algorithm  associated  with  this 
simple  list  will  be  equally  simple,  though  not  terribly 
efficient.  That  is,  the  algorithm  begins  its  search  with  the 
first  character  of  an  input  subject  string  and  does  not  move 
to  the  next  character  until  all  patterns  in  the  data  struc- 
ture are  compared  against  all  possible  matches  beginning 
with  that  character.  It  is  easier  shown  than  said,  as  is 
illustrated  in  figure  3. 

Indexed  Simple  List.  A modification  to  the  simple  list 
is  to  incorporate  an  index  to  point  to  (hash  into)  the  loca- 
tion of  the  first  occurrence  of  a pattern  based  on  some  index 
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Subject  Strin 


A1 qori thm 


Begins  search  with  - - - - A 

Tries H 

Tries-----------S 

Continues  --------- 

Continues  --------- 

Tries-----------H 

Tries-----------! 

Reaches  end  of  data 
structure 

Moves  to  next  input  char 
Tries----------- 

Continues  - - - - - - - - - 
Tries  - --  --  --  --  -- 
Tries-------  - - -- 

Continues  --------- 

Tries----------- 

Reaches  end  of  data 
structure 


no  match 
ma  tch 
match 

match , PN2  ' 
no  match 
no  match 


match 

match,  PNi 
no  match 


no  match 
no  match 


Moves  to  next  input  char 
Tries  - --  --  --  --  - 
Tries  ---------- 

Tries  ---------- 

Tries  ---------- 

Reaches  end  of  data 
structure 


no  match 
no  match 
no  match 
no  match 


Moves  to  next  input  char 
Tries  ---------- 

Tries  - --  --  --  --  - 
Tries  ---------- 

Tries  - --  --  --  --  - 
Reaches  end  of  data 
structure 

Moves  to  next  input  char 


no  match 
no  match 
no  match 
no  match 


Figure  3.  Search  with  Simple  List  Data  Structure 
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value.  A possible  choice  for  index  value  might  be  to  use 
first  character  lexical  values.  Figure  4 shows  such  a data 
structure  and  associated  index.  Note  in  the  figure  that  the 
second  pattern  beginning  with  "H"  is  not  indexed. 

A match  algorithm  using  this  structure  will,  one  would 
think,  require  fewer  comparisons  than  the  algorithm  using  a 
simple  list.  This  assumption  is  borne  out  when  figure  5 is 
compared  to  figure  3.  Figure  5 represents  the  search  steps 
using  an  indexed  simple  list  for  the  same  input  subject 
string  as  figure  3.  Notice  that  in  the  example  the  total 
number  of  steps  has  decreased  from  twenty-nine  to  twenty. 
This  is  a result  of  using  an  index  to  enter  the  data  struc- 
ture. Specifically,  in  figure  5,  the  construct  "INDEX(  )" 
refers  to  the  search  algorithm  checking  the  index  value  of  a 
particular  character.  For  example,  in  step  2,  the  index 
value  of  S,  INDEX(S),  is  equal  to  5 as  shown  in  figure  4. 
(This  use  of  INDEX(  ) will  be  consistent  for  the  remainder 
of  this  paper . ) 

However,  though  the  number  of  steps  in  the  search  has 
decreased,  data  structure  size  has  been  increased  by  the 
addition  of  the  index  table.  It  should  also  be  noted  that 
in  the  worst  case,  when  all  patterns  begin  with  the  same 
character,  this  structure  will  be  no  better  than  a simple 
list.  In  such  a case,  a better  choice  of  indexing  value 
might  be  to  hash  into  the  list  based  on  the  first  two  or 
three  characters.  In  any  event,  the  choice  of  a hash  func- 
tion is  often  very  data  dependent,  and  extensive  studies 
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Figure  4.  Indexed  Simple  List  Data  Structure. 


Subject  Strip 


A1 qori thm 


S H E P 


Begins  search  with  - - - - 4 

INDEX(S)=5,  Tries  S 

Continues  --------- 

Continues  --------- 

Tri es  ---------  --H 

Tries-----------! 

Reaches  end  of  data 
structure 


Moves  to  next  input  char 

INDEX(H)=1 .Tries  - 

Continues  -------- 

Tries---------- 

Tries  - --  --  --  --  - 
Continues  -------- 

Tries  ---------- 

Reaches  end  of  data 
structure 


Moves  to  next  input  char 
INDEX(E)*0,  Fails  on  index 


Moves  to  next  input  char 
IN0EX{P)=0,  Fails  on  index 


Moves  to  next  input  char 


Figure  5.  Search  with  Indexed  Simple  List 


Results 


ch 
ch 

match,  PNz 
no  match 
no  match 


match 

match,  PN] 
no  match 


no  match 
no  match 


12 


into  optimum  selections  have  been  documented  by  others,  so  this 
paper  will  not  become  involved  in  this  area  (Ref  10:  315-358). 


Indexed  Linked  List.  As  the  next  logical  "improvement" 
to  the  indexed  list  one  might  choose  to  link  patterns  of  the 
same  index  value.  Figure  6 shows  how  this  data  structure 
would  appear;  with  each  pattern,  i,  there  is  an  associated 
link  (Lli). 

Now,  the  search  using  this  data  structure  is  improved 
since  only  patterns  with  the  same  index  value  will  be  compared. 

Figure  7 shows  this  relationship  when  compared  to  the  last 
example  in  figure  5 (seventeen  steps  to  twenty  steps). 

Again,  as  in  the  indexed  simple  list,  if  all  patterns 
were  to  have  the  same  index  value , then  the  search  would  be 
no  better  than  a simple  list.  On  the  other  hand,  given  a 
larger  and  more  dispersed  set  of  patterns  the  linked  list 
approach  would  show  an  even  more  significant  improvement. 

Indexed  Binary  Tree.  With  the  choice  of  a better  index 
(hash)  value,  patterns  beginning  with  the  same  character 

could  be  found  uniquely,  or  with  few  sharing  the  same  index  I 

I 

value.  However,  with  this  approach  the  index  table  could 

j 

grow  disproportionately  large  when  compared  to  the  pattern  | 

I 

storage  area,  or  the  hashing  function  could  become  overly 

complex.  An  alternative  improvement  to  the  indexed  linked 

list  is  an  indexed  sorted  linked  list,  of  which  the  binary  | 

tree  is  a possible  choice. 

In  the  binary  tree,  there  are  two  links  associated  with  | 

i 

each  pattern,  a high  link  and  a low  link.  As  each  new 
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INDEX(S)=6,  Tries  - ■ 
Continues  ------ 

Continues  - - - - - ■ 
li.l2®0  . Fails  on. link 


Subject  Strin 


S H E P 


Moves  to  next  input  char 
INDEX(H)=1 , Tries  - - - - 
Continues  -------- 

LIi=13,  Follow  link 
Tries 

Continues  -------- 

Ll3=0,  Fails  on  link 


Moves  to  next  input  char  - 4 

INDEX(E)=0,  Fails  on  index 


15.  Moves  to  next  input  char 

16.  INDEX(P)=0,  Fails  on  index 


17.  Moves  to  next  input  char 


Figure  7.  Search  with  Indexed  Linked  List. 


Resul ts 


match 

match 

match,  PN2 


match 

match,  PNi 

match 
no  match 


A DONE 


pattern  is  added  to  the  data  structure,  its  index  value  is 

<? 

first  calculated.  If  no  pattern  exists  with  that  index  value 
then  the  new  pattern  is  entered  immediately  into  the  data 
structure.  However,  if  a pattern  already  exists  with  that 
index  value,  then  the  new  pattern  and  existing  pattern  are 
compared  to  see  if  the  new  pattern  should  be  added  out  the 
high  or  low  link  of  the  old  pattern.  This  comparison  may  be 
based  on  second  characters,  or  some  combination  of  characters 
As  in  all  indexed  files,  any  attempt  to  optimize  this  hash 
function  can  be  very  data  dependent. 

Figure  8 shows  a binary  tree  data  structure  for  the  same 
patterns  as  in  previous  examples.  Because  only  two  patterns 
share  the  same  index  value,  "HE"  and  "HAT",  the  true  worth 
of  the  binary  tree  is  not  clearly  demonstrated.  Its  value 
will  become  more  evident  in  later  work.  Note  the  additional 
link  for  each  pattern  for  a total  of  two  per  pattern,  a low 
link  (LLi)  and  a high  link  (HLi). 

A search  using  an  indexed  binary  tree  will  require  a 
little  extra  work  to  determine  which  link  to  follow.  This 
overhead,  on  the  average,  should  not  cause  the  matching 
algorithm  to  perform  less  efficiently,  and,  as  can  br  seen 
in  the  example  in  figure  9,  required  one  less  step  than  in 
the  indexed  linked  list.  However,  these  comparative  results 
are  hardly  conclusive,  but  on  the  average,  they  can  be 
expected . 


Subject  Strin 


A1 qori thm 

i y.  i p. 

Resul ts 

1.  Begins  search  with  - - - - 

i 

2.  INDEX(S)=7,  Tries  - 

s 

match 

3.  Continues  - - - - - - - - 

H 

match 

4.  Continues  - - - - - - - - 

E 

match,  PN2 

5.  Compares  second  characters 

check  high 

link 

6.  HL2=0,  Fails  on  link 

7.  Moves  to  next  input  char  - 

A 

8.  INDEX(H)=1 .Tries  - - - - 

H 

match 

9.  Continues  -------- 

E 

match,  PN^ 

10.  Compares  second  characters 

check  high 

link 

1 1 . HL] =0 , Fails  on  link 

12.  Moves  to  next  input  char  - 

A 

13.  INDEX(E)=0,  Fails  on  index 

14.  Moves  to  next  input  char  - 

A 

15.  INDEX(P)=0,  Fails  on  index 

16.  Moves  to  next  input  char  - 

A DONE 

Figure  9.  Search  with  Binary  Tree 
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Indexed  Alternate/Successor  Linked  List.  In  the 
preceding  data  structures  each  pattern  has  been  stored  in  its 
entirety;  there  has  been  no  way  for  similar  patterns  to  share 
storage  of  duplicate  characters.  For  example,  if  the  patterns 
"HE"  and  "HAT"  were  stored,  a minimum  of  five  words  were 
required.  But  if  the  "H"  of  the  two  words  could  share  the 
same  storage  location,  then  the  characters  of  the  two 
patterns  could  be  stored  in  four  words.  This  is  the  concept 
behind  the  alternate/successor  linked  list.  The  study  of 
this  data  structure  was  motivated  by  the  similar  structure 
used  within  SNOBOL  (Ref  6). 

However,  in  building  such  a structure  there  is  consider- 
able overhead  involved  in  maintaining  character  relationships, 
e.g.,.  .in  the  last  example,  the  "H"  belongs  to  two  patterns, 
"HE"  and  "HAT".  This  overhead  is  diminished  as  the  propor- 
tion of  duplicate  leading  characters  is  increased'. 

Specifically,  associated  overhead  can  be  seen  in 
figure  10.  With  each  character  there  is  an  alternate  and 
successor  field  to  identify  the  character  relationships. 

There  is  also  the  necessity  to  include  a pattern  number 
field  for  each  character  since  at  any  character  there  may  be 
a possible  pattern  number.  Therefore,  the  size  of  the  data 
structure  has  increased  quite  a bit  when  compared  to  the 
previous  examples. 

However,  with  increase  in  size,  the  search  algorithm 
for  the  pattern  matching  process  has  become  less  complicated. 
Figure  11  shows  how  the  algorithm  precedes.  When  compared 
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to  the  binary  search,  there  are  fewer  steps  in  the  processing 
of  the  same  input  subject  string. 

Therefore,  of  the  five  structures  thus  far  discussed, 
the  alternate/successor  approach  appears  to  allow  the  most 
rapid  pattern  matching  search.  But,  is  there  room  for 
improvement?  The  answer  is  yes. 

Finite  State  Automata  Linked  List.  Refer  to  figure  11. 

In  steps  3 and  4,  an  "H"  and  an  "E"  are  matched  to  find  the 
pattern  "SHE”  in  "SHEP".  Then  in  steps  7 and  8,  an  "H"  and 
an  "E"  are  again  matched  to  find  "HE"  in  "SHEP".  A more 
efficient  search  would  be  to  find  not  only  "SHE"  but  "HE" 
at  the  same  time;  in  effect  the  search  would  become  "no  back- 
up". That  is,  as  the  algorithm  examines  each  character  of 
the  input  subject  string  it  never  backs  up.  Such  a search 
can  be  achieved  through  the  design  and  construction  of  the 
data  structure. 

The  data  structure  herein  described  which  allows  a no 
back-up  search  will  be  called  a finite  state  automata  (f.s.a.) 
linked  list.  Its  concept  is  taken  from  an  article  by 
Alfred  Aho  and  Margaret  Corasick  (Ref  1).  Briefly,  in  a 
finite  state  automata,  at  any  given  time,  one  is  aware  of 
only  two  things:  (1)  the  current  state,  and  (2)  the  current 
input  character. 

An  actual  implementation  of  an  f.s.a.  data  structure 
will  be  * illustrated  later.  But  in  figure  12  there  is  an 
example  of  the  three  "functions"  required  to  realize  the 
f.s.a.  approach  in  the  pattern  matching  algorithm.  These 
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1.  Begins  search  with  - - - - 4 

2.  rNDEX(S)  = 3,  Tries S match 

3.  SUCCESS0R(3)=4,  Tries  - - - H match 

4.  SUCCESS0R(4)=5,  Tries  - - - E match,  PNo 

5.  SUCCESS0R(5)=0,  Fails  on 
successor  link 

6.  Moves  to  next  input  char  - i 

7.  INDEX(H)=1,  Tries  -----  H match 

8.  SUCCESSOR ( 1 ) =2 , Tries  E match,  PNi 

9.  SUCCESS0R(2)=0,  Fails  on 
successor  link 


10.  Moves  to  next  input  char  - A 

11.  INDEX(E)=0,  Fails  on  index 


12.  Moves  to  next  input  char  - A 

13.  INDEX(P)=0,  Fails  on  index 


14.  Moves  to  next  input  char  - A DONE 

- 1 ■■  1 - 

Figure  11.  Search  with  Alternate/Successor  Linked  Lt,st 
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functions  are  the  GOTO,  FAIL,  and  OUPTUT  functions. 

The  GOTO  function  determines  which  state  to  "go  to" 
given  a current  state  and  input  character.  For  example,  in 
figure  12  (a),  G0TO(0,H)  equals  1;  that  is,  given  state  0 
and  input  character  "H" , the  next  state  to  enter  is  1. 

The  second  function,  FAIL,  is  used  to  determine  which 
state  to  enter,  given  that  the  GOTO  function  has  failed. 

(The  GOTO  function  fails  when  it  is  not  defined  for  a given 
character  and  state.)  For  example,  G0T0(4,X)  fails,  and 
FAIL(4)  equals  1;  therefore,  state  1 is  the  state  to  enter 
from  state  4 if  the  input  character  is  "X".  (Notice  that 
this  is  also  the  case  for  any  input  character  other  than 
an  "E"  in  state  4 . ) 

The  OUTPUT  function  is  referenced  as  each  state  is 
entered  to  determine  if  a pattern  match  has  occurred.  As 
shown  in  figure  12  (c),  OUTPUT  may  signal  that  one  or  more, 
or  no  patterns  have  been  matched  given  a current  state.  For 
example,  0UTPUT(6)  is  the  null  set,  but  0UTPUT(5)  is  the 
set  containing  both  "SHE"  and  "HE".  Therefore,  when  state  6 
is  entered  there  is  no  pattern  match,  but  when  state  5 is 
entered  both  patterns  "SHE"  and  "HE"  have  been  matched. 

Figure  13  shows  the  relatively  simple  search  using  the 
same  input  subject  string  and  patterns  as  in  previous  examples. 
However,  preparing  the  data  structure  and  determining  the 
three  fxnctions  is  much  more  complicated  and  costly  as  will 
be  shown  latter. 

Thus,  it  appears  that  the  pattern  matching  algorithms 
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Figure  12.  Functions  to  Implement  Finite  State  Automata  Algorithm 


24 


Subject  Strin 


A1 qori thm 

S H E P 

Results 

1.  Begins  search  with  - - - - 

▲ 

2.  IN0EX(S)=3 

0UTPUT(3)= 

null 

3.  Moves  to  next  character  - - 

A 

4.  G0T0(3,H)=4 

0UTPUT(4)* 

nul  1 

5.  Moves  to  next  character  - - 

A 

6.  G0T0(4,E)=5 

0UTPUT(5)= 

SHE, HE 

7.  Moves  to  next  character  - - 

A 

8.  G0T0(5,P)=fail 

9.  FAIL(5)=0 

10.  Moves  to  next  character  - - 

- — ^ 

A DONE 

Figure  13.  Search  with  Finite  State  Automata  Data  Structure 


25 


associated  with  the  data  structures  so  far  discussed  may  be 
ranked  from  fastest  to  slowest; 


1 . 

Search 

using 

finite 

state  automata  linked  list. 

2. 

S earch 

using 

indexed 

alternate/successor  linked  list. 

3 . 

Search 

using 

indexe  d 

binary 

tree  . 

4 . 

S earch 

using 

indexed 

linked 

list. 

5 . 

Search 

us  ing 

indexed 

s imple 

list . 

6 . 

Search 

using 

simple 

list . 

However,  the  ranking  of  the  size  of  the  data  structures 
is  also  in  the  same  order,  from  largest  to  smallest.  At  first, 
one  might  suggest  packing  the  patterns  (placing  more  than  one 
character  per  computer  word  of  storage)  as  a solution  to  this 
problem.  And  as  shown  in  figure  14,  a packed  simple  list 
does  indeed  save  space  over  the  unpacked  version,  eight  words 
compared  to  twenty.  (Recall  figure  2.)  But  now  each  packed 
character  must  be  unpacked  in  order  to  perform  comparisons 
during  the  matching  process. 

The  question,  then, is  can  a savings  in  space  be  made 
without  an  unacceptable  sacrifice  in  speed?  In  fact,  can  the 
speed  of  the  matching  algorithm  actually  be  improved  by 
taking  advantage  of  the  fewer  accesses  which  must  be  made 
to  the  data  structure?  For  example,  only  two  words  must  be 
retrieved  to  obtain  the  pattern  "HAT"  and  related  values  in 
the  packed  example  of  figure  14,  whereas,  five  words  must  be 
retrieved  in  the  unpacked  version  of  figure  2.  Another 
thought,  perhaps  the  match  algotithm  can  perform  multi- 
character comparisons  without  unpacking  the  pattern  and 
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thereby,  too,  increase  speed. 

It  is  in  the  interest  of  resolving  these  questions  and 
ideas  that  the  thesis  work  was  conducted.  In  the  following 
chapter  the  various  approaches  used  during  the  thesis  will 
be  discussed.  In  particular,  machine  dependent  (CYBER  74) 
features  will  be  identified  and  an  attempt  to  quantify 
expected  results  will  be  made.  In  Chapter  III  the  test 
procedures  will  be  explained,  and  in  Chapter  IV,  a discussion 
of  the  results  and  their  implications  will  be  made. 
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I I . CYBER  74  Implementations 

In  this  chapter,  the  many  programs  that  were  actually 
implemented  on  the  CDC  CYBER  74  will  be  presented.  There 
were  a total  of  twelve  different  configurations  based  on  type 
of  data  structure,  type  of  subject  input  (packed  or  unpacked), 
and  whether  the  data  structure  itself  was  packed  or  unpacked. 
Test  Program  Design 

The  first  step  in  constructing  the  programs  was  to 
develop  a design  around  which  all  programs  could  be  built. 

It  was  necessary  that  this  design  allow  each  matching 
algorithm  (that  portion  of  the  program  which  searched  the 
subject  string  for  pattern  matches)  to  be  timed  separately 
from  input  and  output  constraints.  It  was  also  decided  that 
the  data  structure  should  be  able  to  be  modified  indepen- 
dently of  the  search  routine,  and  that  the  data  structure 
should  be  "hidden"  from  the  search  routine  as  much  as 
possible.  The  "bubble  chart"  (Ref  4)  of  the  resulting  design 
is  shown  in  figure  15. 

The  afferent  (input)  portion  of  the  program  consists  of 
the  modules  which  access  the  input  file  and  present  to  the 
central  transforms  (processing  modules)  an  input  record. 

This  input  record,  in  theory,  can  consist  of  any  number  of 
characters,  either  packed  or  unpacked.  Now,  this  record  may 
be  a pattern  which  is  to  be  added  to  or  deleted  form  the  data 
structure,  or  it  may  be  a subject  string  (text)  which  is  to 
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Figure  15.  Bubble  Chart  for  Test  Program  Design 


be  searched  for  pattern  matches.  It  is  the  function  of  the 
central  transform  "BUILD  INFO  RECORD"  to  determine  which  of 
these  possibilities  the  input  record  fills  and  modify  the 
record  as  necessary  to  produce  "Pattern  Info"  or  "Text  Info" 
as  appropriate.  If  the  record  is  a pattern,  it  is  sent  to 
the  efferent  (output)  module  which  handles  modifications  to 
the  data  structure  along  with  the  associated  command 
indicating  the  pattern  is  to  be  added  to  or  deleted  from 
the  data  structure.  On  the  other  hand,  if  the  input  record 
is  text,  then  it  is  passed  to  the  other  central  transform 
"SEARCH  FOR  MATCH"  which  does  just  that,  i.e.,  processes 
the  input  record  looking  for  pattern  matches.  Match 
information  is  then  relayed  to  the  "OUTPUT  MATCH  INFO"  module 
which  will  identify  the  results  of  the  matching  algorithm." 

The  corresponding  structure  chart  is  shown  in  figure  16. 
Here,  the  actual  program  elements  and  the  data  that  flow 
between  them  are  pictured.  As  implied  in  this  chart,  the 
executive  module  controls  the  overall  sequencing  and  hence 
execution  of  the  other  modules. 

The  module  "GETREC"  and  subordinate  modules  represent 
the  single  afferent  branch  of  the  bubble  chart,  inputting 
information  to  the  executive.  The  two  central  transforms 
are  represented  by  "BINFO"  and  "SEARCH".  Their  functions 
are  those  as  described  for  the  corresponding  "bubbles" . 

Notice  that  "SEARCH"  accesses  the  data  structure  through 
the  module  "GETPAT" . In  this  manner  the  data  structure  is 
effectively  hidden  and  the  coding  for  "SEARCH"  will  not 
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Figure  16.  Structure  Chart  for  Test  Program  Design 


become  involved  in  actual  data  structure  manipulation. 

(Albeit,  the  search  module  is  highly  related  to  the  design 
of  the  data  structure.) 

The  remaining  efferent  branches  of  the  bubble  chart  are 
fulfilled  by  the  modules  "MODDS"  and  "PUTOUT".  The  subor- 
dinate modules  to  "MODDS"  represent  those  functions  that  may 
be  necessary  when  modifying  the  data  structure--initialize , 
add,  or  delete.  Again,  the  data  structure  is  hidden  from 
"MODDS"  also,  through  "GETPAT"  plus  the  module  "PQTPAT".  (It 
sould  be  mentioned  that  within  the  context  of  this  thesis, 
a deleted  pattern  will  not  necessarily  free  any  storage  space. 
In  any  event,  pattern  deletions  and  associated  "garbage  col- 
lection" procedures  will  not  be  considered  during  performance 
evaluation . ) 

Thus,  the  initial  design  criteria  are  established  by 
the  structure  chart  in  figure  16.  All  programs  were  coded 
within  this  design  using  CDC  Fortran  IV  Extended.  Some 
departures  from  the  general  design  did  occur  and  will  be 
mentioned  during  the  description  of  the  implementation. 
Essentially,  only  the  modules  "SEARCH",  "GETPAT",  and 
"PUTPAT"  varied  from  one  program  to  another. 

In  the  remainder  of  this  chapter,  each  program  will  be 
uniquely  identified  by  its  six  character  name,  e.g.,  EXECIA. 
The  fifth  character  will  always  be  a number  and  refers  to  the 
type  of  data  structure  implemented.  The  sixth  character  will 
be  the  letter  "A",  "B" , or  "C"  referring  to  the  type  of  input 
and  data  structure  used.  Figure  17  lists  these  program  names 
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Program 

Type  of 

Input 

Data 

Name 

Data  Structure 

Text 

Structure 

EXEC1A 

Simple  List 

Unpacked 

Unpacked 

EXEC2A 

Indexed  Simple  List 

Unpacked 

Unpacked 

EXEC3A 

Indexed  Binary  Tree 

Unpacked 

Unpacked 

EXEC4A 

Alternate/Successor  Linked  List 

Unpacked 

Unpacked 

EXEC6A 

Finite  State  Automata 

Unpacked 

Unpacked 

EXEC IB 

Simple  List 

Unpacked 

Packed 

EXEC2B 

Indexed  Simple  List 

Unpacked 

Packed 

EXEC3B 

Indexed  Binary  Tree 

Unpacked 

Packed 

EXEC4B 

Alternate/Successor  Linked  List 

Unpacked 

Packed 

EXEC5B* 

Alternate/Successor  Linked  List 

Unpacked 

Packed 

EXEC3C 

Indexed  Binary  Tree 

Packed 

Packed 

EXEC4C 

Alternate/Successor  Linked  List 

Packed 

Packed 

*-  Data  structure  for  EXEC5B  is  same  as  EXEC48;  however, 
differ  in  method  of  search.  (See  text.) 

the  two 

Figure  17.  Programs  Developed  to  Test  Pattern  Matching  Algorithms 


34 


and  identifies  their  characteristics. 


Unpacked  Approaches 

EXECIA  (Simple  List).  The  fi.’st  program  developed, 
EXECIA,  used  a simple  list  data  structure, and  both  the  input 
subject  text  and  the  data  structure  were  unpacked.  This 
approach  was  chosen  as  the  first  implementation  for  several 
reasons : 


1.  It  was  simple, 

2 . A simple  program  was  needed  to  validate  the  initial 
software  design, 

3.  It  would  provide  a base  upon  which  other  programs 
could  be  built  and  compared,  and 

4.  Its  searching  logic  closely  approximates  an 
intuitive  approach. 

The  resulting  data  structure  of  EXECIA  was  identical  to 
the  one  pictured  in  figure  2 of  the  last  chapter.  Each 
pattern  was  entered  into  the  data  structure,  one  character 
per  word,  right  justified,  with  zero  fill.  Thus,  each  six 
bit  character  was  essentially  an  integer  value  (0  - 63). 
(Appendix  A shows  the  CDC  display  code  for  legal  characters.) 
With  each  pattern  there  were  also  two  words  storing  the 
values  for  pattern  number  and  pattern  length. 

The  search  algorithm  was  coded  to  perform  the  matching 
function  as  shown  in  figure  3.  Since  the  input  text  was  also 
unpacked  (right  justified,  zero  fill)  character  comparisons 
were  simple  checks  for  integer  equality.  Due  to  the  simplicity 
of  this  method,  there  were  no  difficulties  in  implementation. 
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EXEC2A  (Indexed  Simple  List).  As  an  improvement  to  the 
simple  list,  EXEC2A,  with  its  indexed  list,  was  the  logical 
choice  to  implement  next.  The  index  value  for  a given  pattern 
was  chosen  to  be  the  display  code  value  of  the  first  character 
plus  one.  Thus,  the  index  values  ranged  from  one  to  sixty- 
four,  and  so,  the  index  table  was  established  as  sixty-four 
words  long.  Its  function  then,  was  as  illustrated  in 
figure  4,  to  point  to  the  first  occurrence  of  a pattern  with 
a given  first  character. 

The  search  algorithm  coding  for  EXEC2A  was  almost  iden- 
tical to  EXECIA.  The  only  change  was  to  reference  the  index 
table  when  looking  for  a possible  first  character  match  for 
each  new  input  character. 

EXEC3A  (Indexed  Binary  Tree).  Referring  to  Chapter  I, 
an  indexed  linked  list  followed  an  indexed  list  in  order  of 
expected  efficiency.  However,  EXEC3A  implements  an  indexed 
binary  tree,  skipping  the  linked  list.  This  choice  was  an 
arbitrary  omission  based  on  an  interest  in  other  implementa- 
tions. Due  to  time  constraints,  one  implementation  was  chosen 
to  be  omitted  from  the  thesis;  the  indexed  linked  list  was 
the  one  eliminated. 

The  indexed  binary  tree  data  structure  of  EXEC3A  was 
similar  to  the  one  pictured  in  figure  8.  As  in  the  other 
implementations,  each  pattern  character  was  stored  right 
justified,  zero  filled,  in  a word  of  storage.  High  and  low 
pointer  values  were  established  based  on  comparison  of  second 
character  display  code  values.  That  is,  if  a new  pattern  were 
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entered  in  the  data  structure  and  its  second  character  were 
greater  than  or  equal  to  the  second  character  of  an  existing 
pattern,  then  the  new  pattern  would  be  added  out  the  high 
pointer.  Otherwise,  it  would  be  added  out  the  low  pointer. 

If  the  new  pattern  were  only  a single  character,  then  it  was 
placed  at  the  head  of  the  tree  and  tied  to  the  index  table 
regardless  of  existing  patterns.  Figure  18  shows  a set  of 
patterns  and  the  binary  tree  that  would  result  if  the  data 
structure  building  algorithm  of  EXEC3A  were  applied.  (A  note 
to  the  reader--the  figure  is  not  meant  to  imply  a packed  data 
structure,  just  that  with  each  pattern  there  are  associated 
high  and  low  pointers.) 

The  choice  of  second  character  to  determine  high  or  low 
order  was  perhaps  not  the  best,  since,  if  there  were  many 
patterns  all  with  the  same  first  and  second  character,  then 
the  binary  tree  would  become  merely  a linked  list.  However, 
the  expected  input  during  the  test  was  not  to  be  of  this  form 
and  therefore,  would  not  require  a "better"  hash  function. 

The  search  algorithm  for  EXEC3A  was  coded  to  perform 
the  pattern  matching  algorithm  using  the  unpacked  binary  tree 
data  structure  and  unpacked  input  text, 

EXEC4A  ( Alternate /Succe ssor  Linked  List).  EXEC4A  was 
developed  to  use  the  alternate/successor  linked  list  data 
structure  described  in  Chapter  I.  As  mentioned  then,  this 
structure  is  much  more  complex  than  the  others  so  far  devel- 
oped; two  pointer  words  and  one  pattern  number  word  are  needed 
for  each  character  stored  in  the  data  structure.  Because  of 
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Patterns  and  the  order  entered  into  data  structure  are: 


1. 

HIGH 

2. 

HOT 

3. 

HEM 

4. 

HIDE 

5. 

HAM 

6. 

HOPE 

7. 

HAT 

HIPTR-  High  Pointer 


Figure  18.  Binary  Tree  as  Would  be  Constructed  by  EXEC3A. 
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this  "extra"  information,  slight  departures  were  made  from 
the  initial  program  design  of  figure  16. 

In  particular,  the  module  "ADDPAT"  was  given  direct 
access  to  the  data  structure  rather  than  working  solely 
through  the  data  hiding  modules  "GETPAT"  and  "PUTPAT".  As 
a matter  of  fact,  "GETPAT"  and  "PUTPAT"  too,  were  changed 
for  EXEC4A.  Their  new  functions  were  to  manipulate  entries 
corresponding  to  single  characters  rather  than  to  work  with 
entire  patterns. 

The  search  routine,  however,  presented  no  departures 
from  the  original  design.  In  fact,  the  complexity  of 
constructing  the  data  structure  actually  had  an  inverse 
effect  on  the  search  module,  decreasing  the  difficulty  of 
its  coding. 

Packed  Approaches 

After  EXEC4A  was  implemented,  rather  than  proceeding  to 
the  more  complex  finite  state  automata  data  structure,  it 
was  decided  that  more  advantage  should  be  made  of  the  CDC 
hardware  given  the  then  completed  programs,  EXECIA,  EXEC2A, 
EXEC3A,  and  EXEC4A.  The  obvious  idea  was  to  make  use  of  the 
large  (sixty  bit)  CDC  computer  word,  that  is,  to  pack  infor- 
mation. In  this  way,  up  to  ten  characters  could  be  placed 
in  a single  word,  thereby  reducing  data  structure  storage 
requirements  plus,  it  was  hoped,  decrease  search  time  through 
fewer  accesses  to  the  data  structure. 

Another  hardware  concept  identified  as  nice  to  test, 
would  be  to  make  more  efficient  use  of  the  CPU  registers 
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during  the  search  process.  However,  this  would  most  certainly 
require  machine  level  coding  (COMPASS),  and  it  was  decided 
that  this  avenue  would  not  be  pursued  at  this  time. 

EXECIB  (Simple  List).  In  making  use  of  a packed  data 
structure,  EXECIA  was  the  first  program  to  be  modified  pro- 
ducing the  new  program,  EXECIB.  Figure  19  is  the  resulting 
data  structure  produced  by  EXECIB.  When  this  is  compared  to 
the  unpacked  version  in  figure  2,  one  can  see  that  indeed, 
much  space  can  be  saved,  especially  when  dealing  with  long 
patterns  such  as  number  five,  "LONGER  THAN  TEN".  Notice  also, 
that  the  length  of  the  pattern  in  both  words  (LW^)  and 
characters  (LCi)  along  with  the  pattern  number  are  packed 
into  a single  word. 

The  pattern  is  packed  left  justified,  and  the  fill  is 
unimportant.  In  this  way  a particular  character  may  be 
selected  by  a simple  left  circular  shift  and  mask  operation 
based  on  a desired  character  position.  For  example,  in 
pattern  number  four,  if  one  wished  to  examine  character 
position  three,  the  Fortran  Extended  statement  would  be: 

CH AR= SHI FT ( PATTERN ,CHARP0S*6 ) .AND. ( . NOT . MASK( 54) ) 

where  PATTERN  is  the  pattern  "THEY",  and  CHARPOS  has  the 
value  three.  After  execution  of  this  statement,  "CHAR" 
would  contain  the  single  value  representing  the  letter  "E"  , 
which  is  the  desired  answer. 

The  search  algorithm  then,  was  written  to  perform  this 
unpacking  operation  as  each  character  of  a pattern  was 
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Pattern  # Pattern 

1 HE 

2 SHE 

3 HAT 

4 THEY 

5 LONGER  THAN  TEN 


LW-  Length  of  pattern  in  words 
LC-  Length  of  pattern  in  characters 
PN-  Pattern  number 


Figure  19.  Packed  Simple  List  Data  Structure  of  EXECIB. 
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needed.  Otherwise,  its  operation  was  identical  to  that  of 
the  algorithm  in  EXECIA.' 

EXEC23  (Indexed  Simple  List).  The  next  program  to  incor- 
porate a packed  data  structure  was  EXEC2B.  Like  EXEC13,  EXEC2B 
was  a redesign  of  its  unpacked  counterpart.  The  resulting 
data  structure  was  identical  to  that  of  EXECIB  with  the 
addition,  of  course,  of  the  index  table.  The  index  table 
could  itself  have  been  packed,  say,  by  placing  four  index 
values  per  word.  However,  it  was  decided  not  to  do  this 
since  the  overhead  in  obtaining  a packed  index  value  would 
not  be  worth  the  minimal  savings  in  space. 

There  were  no  problems  in  implementing  EXEC2B. 

EXEC3B  (Binary  Tree).  EXEC3B  is  the  packed  data  struc- 
ture version  of  EXEC3A.  As  in  EXEC2B,  the  index  table  was 
not  packed,  and  each  pattern  had  its  related  information 
packed  into  a single  v7ord.  Figure  20  shows  the  packed 
structure  of  EXEC3B.  One  can  see  that  each  pattern  is  stored 
left  justified  as  in  the  other  packed  implementations. 

Therefore,  the  resulting  change  to  the  search  module 
of  EZEC3A  in  order  to  work  for  EXEC3B  was  the  same  shift  and 
mask  operation  described  for  EXECIB. 

EXEC4B  (Alternate/Successor  Linked  List).  The  changes 
thus  far  described  to  achieve  packed  data  structures  have 
been  relatively  easy  to  implement,  though  not  trivial. 

However,  achieving  a packed  version  of  the  alternate/suc- 
cessor linked  list  was  not  nearly  so  simple.  The  first 
question  that  came  to  mind  was  just  what  do  you  pack,  or 
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PN-  Pattern  number 

LW-  Length  of  pattern  in  words 

LC-  Length  of  pattern  in  characters 

HP-  High  pointer 

LP-  Low  pointer 


Figure  20.  Packed  Binary  Tree  Data  Structure  of  EXEC3B 
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more  appropriately,  what  can  be  packed? 

The  answer  to  this  question  was  an  involved  approach 
that  allowed  as  much  of  a pattern  to  be  packed  as  possible. 
Specifically,  shared  leading  characters  of  two  or  more 
patterns  were  packed  in  a word  and  the  remaining  unique 
characters  of  each  pattern  packed  in  separate  words.  The 
packed  portions  were  linked  by  successor  and  alternate  links 
as  in  EXEC4A.  Figure  21  shows  step  by  step  how  the  data 
structure  would  be  built  by  EXEC4B  for  the  given  patterns. 

One  can  see  that  the  data  structure  can  quickly  become 
complicated.  Nonetheless,  the  extra  time  spent  in  building 
the  data  structure  was  expected  to  provide  decreased  time 
in  the  actual  pattern  matching  algorithm. 

So,  at  this  time  all  unpacked  pattern  matching  programs 
had  been  converted  to  packed  operation.  However,  the  packed 
patterns,  once  retrieved  from  the  data  structure  were  still 
being  unpacked  for  comparisons  during  the  pattern  matching 
search.  With  up  to  ten  characters  available  for  a single 
comparison,  it  was  reasoned  that  perhaps  multi-character 
comparisons  could  be  achieved  using  packed  input  text  and 
thereby,  speed  up  the  search.  This  was  the  motivation  behind 
creating  EXEC3C  and  EXEC4C. 

EXEC3C  (Binary  Tree,  Packed  Input).  The  data  structure 
building  and  accessing  modules  for  EXEC3C  were  used  as 
coded  from  EXEC3B.  (Figure  20  shows  the  data  structure.) 
However,  the  search  module  was  very  much  changed. 

In  EXEC3C  the  concept  was  to  bring  packed  text  into  the 
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ACTION 


RESULTING  DATA  STRUCTURE 


Figure  21.  Sample  Data  Structure  Construction  Steps  for  EXEC4B 
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search  module  and  search  for  pattern  matches  using  direct 
comparisons  between  the  text  and  packed  patterns;  there  was 
to  be  no  unpacking.  In  this  way,  up  to  ten  characters  could 
be  tested  for  equality  by  a single  comparison.  Also,  10,000 
characters  of  input  text  in  previous  implementations  required 
a 10,000  word  buffer  if  brought  in  all  at  once,  but  in  EXEC3C 
(and  EXEC4C)  the  same  number  of  characters  could  be  brought 
in  in  a 1,000  word  buffer--a  ten  to  one  savings  in  space. 
Equivalently,  given  the  same  size  buffer  area,  EXEC3C  need 
refill  it  only  a tenth  as  often  as  for  the  other  programs, 
another  possible  increase  in  search  performance. 

The  search  module  was  coded  with  these  ideas  in  mind-- 
multi-character  comparisons  with  packed  input  text  and  pat- 
terns. The  approach  works  fine  when  the  pattern  occurs 
within  the  text  in  the  same  relative  position  as  it  is  stored 
in  the  data  structure,  i.e.,  left  justified  within  a word  of 
storage.  Figure  22  (a)  shows  this  concept.  However,  if  the 
pattern  occurs  across  word  boundaries,  then  either  the  text 
must  be  unpacked  and  repacked  to  form  an  occurrence  as  in  (a) 
or  the  pattern  must  be  "broken-up"  like  in  the  text,  and 
therefore,  two  comparisons  must  be  made.  This  latter  alter- 
native was  chosen  for  EXEC3C  and  is  shown  conceptually  in 
figure  2 2 ( b ) . 

Specifically,  the  search  module  for  EXEC3C  was  implemen- 
ted requiring  three  times  as  much  code  as  the  longest  search 
module  of  the  previous  programs. 
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Packed  Text: 


HERE  IS  A PATTERN  IN  A STRING. 


Step  1.  Get  packed  pattern  from  data  structure  . IPATTERn' 
Step  2.  MASK  word  two  of  input  text  to  get.  . . . [PATTERN" 
Step  3.  Compare  the  two  to  get  complete  match 


(a)  Occurrence  at  Beginning  of  Text  Word 


Packed  Text: 


HERE  A PAmERN  CROSSES  A WORD. 


Step  1.  SHIFT  and  MASK  word  one  of  text  to  get.  . . [PAT 


Step  2.  MASK  pattern  to  get. [P^ 

Step  3.  Compare  the  two  to  get  partial  pattern  match 


Step  4.  MASK  (no  SHIFT)  word  two  of  text  to  get  . . TERN 


Step  5.  SHIFT  and  MASK  pattern  to  get [TERN 

Step  6.  Compare  the  two  to  complete  pattern  match 


(b)  Occurrence  Across  Word  Boundary 


Figure  22.  Pattern  Matching  as  done  in  EXEC3C  and  EXEC4C. 
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EXEC4C  CAlt/Suc  Linked  List,  Packed  Input).  EXEC4C  was 


imple-mented  essentially  be  taking  EXEC4B  and  replacing  its 
search  module  with  one  coded  similarly  to  EXEC3C.  Therefore, 
the  data  structure  of  EXEC4C  is  identical  to  that  shown  in 
figure  21,  and  its  search  algorithm  performed  as  described 
for  EXEC3C  and  as  illustrated  in  figure  22. 

Special  Approaches 

So,  with  the  completion  of  EXEC4C  there  were  ten  programs 
with  which  to  enter  testing.  However,  preliminary  analysis 
of  the  check-out  runs  of  these  programs  suggested  another 
approach  might  be  taken  resulting  in  a new  program,  EXEC5B. 

EXEC5B  (Special  Alt/Suc).  EXEC5B  was  a modification  to 
EXEC4B.  It  was  the  result  of  two  observations  made  during 
the  debug  runs  of  the  previously  completed  programs: 

1.  EXEC4B  appeared  to  be  producing  satisfactory  search 
times  cespite  its  possibly  complex  data  structure. 

2.  EXEC3C  and  EXEC4C  did  not  seem  to  be  doing  so  well 
with  the  packed  input  text  and  multi-character  comparisons. 

So,  EXEC5B  was  designed  making  use  of  the  data  structure 
of  EXEC4B  (same  as  EXEC4C)  and  combining  the  unpacked  text 
approach  with  the  ability  to  perform  multi-character  compar- 
isons which  motivated  the  creation  of  EXEC3C  and  EXEC4C  in 
the  first  place.  The  concept  of  this  approach  is  shown  in 
figure  23.  The  unpacked  input  text  is  packed  into  a single 
word,  and  patterns  are  compared  against  this  packed  word.  If 
the  pattern  match  continues  beyond  the  ten  packed  characters, 
the  comparisons  are  then  made  character  by  character  as  in 
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Figure  23.  Concept  of  Search  as  Used  in  EXEC5B 


E''EC4B.  On  the  other  hand,  if  there  are  no  pattern  matches, 
then  new  input  characters  are  shifted  into  the  temporary 
word  from  the  right  and  comparisons  are  made  against  this 
new  temporary  word. 

With  the  completion  of  EXEC5B  there  were  eleven  machine 
dependent  pattern  matching  programs  implemented.  The  final 
program  to  be  built  was  designed  using  the  algorithmic 
description  of  the  finite  state  automata  approach  presented 
by  Aho  and  Corasick  (Ref  1). 

EXEC6A  ( f . s . a . ) . The  finite  state  automata  approach  of 
EXEC6A  was  achieved  using  a modified  version  of  the  data 
structure  in  EXEC4A--the  linked  list.  And  the  search  algo- 
rithm was  coded  to  perform  the  three  functions,  GOTO,  FAIL, 
and  OUTPUT  as  described  in  Chapter  I . 

Construction  of  the  data  structure,  from  a simplistic 
viewpoint,  was  a two  step  process.  First,  patterns  were 
entered  into  the  data  structure  using  slightly  modified 
routines  from  EXEC4A.  Then,  a special  module  was  invoked 
which  calculated  the  remaining  values  to  be  stored  in  the 
data  structure;  these  were  the  FAIL  and  OUTPUT  values  for 
each  state . 

Figure  24  shows  the  finite  state  automata  data  structure 
that  would  be  constructed  using  the  algorithm  of  EXEC5A. 

(This  may  be  compared  to  figure  10  which  shows  the  equivalent 
alternate/successor  linked  list  data  structure.)  In  figure  24 
each  row  corresponds  to  a state.  The  character  associated 
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with  that  row  is  the  character  which,  if  matched,,  would  cause 


the  search  algorithm  to  GOTO  the  state  in  SUCCESSOR  of  that 
row.  For  example,  if  the  search  algorithm  is  in  state  (row) 
five  and  it  has  an  "H"  then,  it  would  GOTO  to  state  six 
since  SUCCE3S0R(5)  equals  six.  Then,  while  in  state  six,  if 
the  search  algorithm  did  not  have  an  "E"  then,  it  would  FAIL 
to  state  two  where  it  would  look  for  "E"  again,  which  it  does 
not  have,  and  would  finally  FAIL  to  state  zero.  In  this  last 
example  a weakness  in  EXEC6A  is  illustrated;  the  data  struc- 
ture does  not  eliminate  possible  redundant  failure  transi- 
tions. Authors  Aho  and  Corasick,  however,  discuss  in  their 
article  (Ref  1)  how  such  a "deterministic”  finite  state 
automata  may  be  generated.  However,  it  was  believed  that 
this  refinement  would  net  minor  benefits  in  search  time,  so 
its  implementation  is  offered  as  a recommendation  for  further 
study . 

The  OUTPUT  fun'ctionof  the  search  algorithm  operated  by 
signalling  a match  had  occurred  if  a state  were  entered  and 
the  corresponding  pattern  number  was  not  zero.  Thus,  if 
state  seven  were  entered,  then  pattern  number  two  would  be 
signalled  as  matched.  Also,  CONCURRENT  OUTPUTS  (figure  24, too) 
identifies  if  another  pattern  is  matched  at  the  same  time  as 
another.  For  example,  if  the  pattern  "SHE"  is  matched  so, 
also,  is  the  pattern  "HE".  That  is  why  the  CONCURRENT  OUTPUTS 
entry  for  pattern  number  two  is  equal  to  one.  Thus,  when 
pattern  number  two  is  signalled  as  found,  so  is  pattern 
number  one. 
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INDEX 

TABLE 


12 

13 

14 
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PATTERN 


CHARACTER 

SUCCESSO 

R ALTERNATE 

FAIL 
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CONCURRENT 


*-  The  negative  value  indicates  that  this  state  has  no  character 
which  may  cause  a GOTO  transition. 

Figure  24.  Sample  of  Finite  State  Automata  Data  Structure  of  EXEC6A 
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So,  with  implementation  of  EXEC6A  the  twelve  machine 
dependent  pattern  matching  programs  listed  in  figure  17 
were  complete  and  ready  for  evaluation  testing. 

But  before  proceeding  to  the  next  chapter,  it  should  be 
mentioned  that  during  the  research  for  this  thesis  another 
algorithm  was  found  (Ref  2).  Its  design  revolves  around  a 
data  structure  that  is  built  specifically  for  a single 
pattern.  Hence,  it  could  not  easily  accommodate  the  multi- 
pattern data  structures  of  this  thesis.  However,  as  will  be 
reiterated  in  the  recommendations  chapter,  this  algorithm 
should  be  investigated  and  compared  to  the  results  of  this 
work . 
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III.  Testing  Procedures 

In  this  chapter  the  criteria  used  to  evaluate  the  per- 
formance of  the  pattern  matching  programs  will  be  discussed, 
and  how  the  statistics  were  gathered  will  be  described. 

First,  one  should  recall  the  pattern  matching  problem. 

It  is,  given  a set  of  patterns  (in  a data  structure)  to  find 
all  occurrences  of  these  patterns  in  a given  input  subject 
string.  For  all  practical  purposes  this  input  string  may  be 
considered  text,  but  this  "text"  could  well  be  sourct  ode 
and  the  pattern  matcher  a compiler,  as  described  in  Chapter  I. 

However,  the  efficiency  of  the  pattern  matching  program 
can  become  very  data  dependent.  That  is,  if  the  input  text 
were  Fortran  source  code,  one  might  expect  a higher  incidence 
of  patterns  beginning  with  a particular  character  than  say, 
might  be  found  in  regular  English  text.  (As  an  example, 
many  programmers  tend  to  choose  variable  names  that  begin 
with  the  same  letter,  e.g.,  INAME , lADDR,  ISSAN,  etc.)  Such 
a situation  would  lessen  the  value  of  an  indexed  data 
structure  based  on  first  characters  as  has  been  done  in  all 
the  programs  of  this  thesis.  Another  thought,  how  does  the 
number  of  patterns  affect  the  speed  of  the  search  from  one 
approach  to  another?  Also,  does  the  expected  frequency  of 
occurrence  (incidence)  have  any  effect?  With  these  ideas  in 
mind,  a test  plan  was  developed. 
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It  was  decided  that  test  data  should  include  both 


English  text  and  some  programming  language  code--Fortran 
was  chosen.  And  in  order  to  minimize  inherent  peculiarities 
attributable  to  construction  of  English  or  Fortran  text,  it 
was  also  decided  that  a text  of  random  combinations  of  char- 
acters should  be  used  also.  Therefore,  three  text  files  were 
chosen  to  be  used  during  testing:  English  text,  Fortran 
source  code,  and  random  text. 

Next,  suitable  choices  for  patterns  had  to  be  made.  It 
was  decided  that  the  following  pattern  combinations  should 
be  made: 

1.  Large  number  of  patterns  occurring  often  in  the 
input  text.  (LH-  large  number,  high  incidence) 

2 . Sma.ll  number  of  patterns  occurring  often  in  the 
input  text.  (SH-  small  number,  high  incidence) 

3.  Large  number  of  patterns  occurring  rarely  in  the 
input  text.  (LL-  large  number,  low  incidence) 

4.  Small  number  of  patterns  occurring  rarely  in  the 
input  text.  (SL-  small  number,  low  incidence) 

Therefore,  the  three  input  texts  were  analyzed  and 
frequency  counts  made  on  occurrences  of  individual  words  and 
particular  combinations  of  characters.  From  the  results  of 
this  work,  the  twelve  pattern  files  to  be  used  were  con- 
structed. During  testing  each  of  these  pattern  files  was 
combined  with  the  appropriate  text  file,  English,  Fortran,  or 
random,  to  form  twelve  distinct  test  files.  The  original 
fifteen  files  (three  text  and  twelve  pattern)  are  described 
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y*- 


in  figure  25 . 


At  this  time  it  was  necessary  to  identify  the  performance 
measures  that  would  be  needed  in  later  evaluation.  The 
single  obvious  measure  was  the  time  required  for  the  search 
module  to  complete  the  pattern  matching  process.  After  all, 
this  is  what  was  to  actually  determine  the  "fastest"  algorithm. 
However , it  would  be  equally  important  to  have  some  idea  as 
to  why  one  algorithm  was  faster  than  another.  So,  other 
measures  were  identified.  These  included  the  number  of  times 
the  data  structure  was  accessed  during  the  search  and  the 
number  of  words  returned.  These  measures  would  indicate  the 
amount  of  work  involved  in  communicating  with  the  data 
structure.  Also  included  were  the  total  number  of  equality 
checks  made  and  the  number  of  successful  ones  which  determined 
a pattern  match.  These  measures  would  provide  a relative 
efficiency  for  the  various  search  algorithms. 

Another  proposed  measure  was  the  total  number  of  compar- 
isons made  during  the  search  which  would  give  an  idea  of  the 
general  "overhead"  processing.  (This  last  measure  was  chosen 
because  of  the  relatively  large  amount  of  CPU  time  required 
for  a comparison  operation.)  Two  other  measures  chosen  were 
the  amount  of  time  spent  in  constructing  the  data  structure 
and  the  size  of  the  completed  structure.  These  last  two 
would  be  an  especially  important  factor  in  a dynamic  pattern 
structure  as  will  be  discussed  later. 

In  all,  nine  different  performance  evaluation  measures 
were  chosen  and  the  appropriate  statistics  gathering  state- 
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TEXT  FILES 

File 

Name 

FORTFIAN-  237  lines  of  Fortran  source  code  totalling  22,201  char- 
acters. (Trailing  blanks  at  end  of  line  not  counted.) 


ENGLISH-  266  lines  of  English  text  totalling  15,015  characters. 
(Trailing  blanks  at  end  of  each  line  not  counted.) 


RANDOM- 

200  lines  of 
made  up  of  22 

random  text  totalling 
unique  characters. 

15,762  characters 

PATTERN  FILES 

File 

Name 

Number  of 
Patterns 

Total  Number 
of  Characters 

In  Patterns 

Number  of  Unique 
First  Characters 
In  Patterns 

LHFOR 

50 

297 

20 

SHFOR 

10 

34 

10 

LLFOR 

50 

361 

19 

SLFOR 

10 

72 

6 

LHENG 

50 

312 

19 

SHENG 

10 

26 

7 

LLENG 

50 

373 

20 

SLEN6 

10 

101 

9 

LHRAN 

50 

182 

21 

SHRAN 

10 

14 

8 

LLRAM 

50 

199 

21 

SLRAN 

10 

31 

9 

LH-  Large  High  incidence 
SH-  Small  High  incidence 

LL-  Large  Low  incidence 

SL-  Small  Low  incidence 

Figure  25. 

Files  Created 

for 

Evaluation  Testing 
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ments  were  placed  in  each  of  the  twelve  programs.  These  nine 
measures  are  described  in  figure  26.  Also  listed  in  figure  26 
are  two  calculable  measures.  One  provides  an  average  number 
of  match  comparisons  made  to  find  a pattern.  The  other  gives 
an  indication  of  the  average  amount  of  space  required  to 
store  a pattern.  Both  these  measures  were  expected  to  vary 
widely  depending  on  type  of  input  text  and  pattern  choice. 

The  final  test  plan  then,  was  to  execute  each  of  the 
twelve  programs  with  each  of  the  twelve  different  data  files 
resulting  in  a total  of  144  executions.  In  the  next  chapter 
the  results  of  these  runs  will  be  discussed. 
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Search  Time- 


That  amount  of  time  spent  searching  input  text 
for  pattern  matches.* 


Construct  Time-  That  amount  of  time  spent  building  the  data 

structure.* 

Structure  Size-  Minimum  number  of  words  to  contain  data  struc- 
ture. 

Words  Returned-  Total  number  of  words  obtained  form  data  struc- 
ture during  search. 

Match  Checks-  Total  number  of  equality  checks  made  while 

comparing  patterns  to  text.  (Can  be  character 
or  word  comparisons  depending  on  program) 

Check  Successes-  Total  number  of  successful  equality  checks 

which  is  by  no  means  equal  to  number  of 
patterns  found. 

Patterns  Found-  Total  number  of  patterns  found  during  search. 

Search  Compar’sons-  Total  number  of  comparisons  made  during  the 

search  exclusive  of  tho^e  made  implicitly 
within  iterative  DO  LOOP'S  and  intrinsic 
functions  such  as  MINO. 


Other  Calculable  Measures 

Match  Checks  t-  Patterns  Found  Average  number  of  comparisons  to 

find  a pattern. 

Structure  Size  -r  Number  of  Patterns-  Average  number  of  words 

required  per  pattern. 


*-  According  to  CDC  these  times  are  accurate  to  .01  seconds. 


Figure  26.  Performance  Measures 
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rv . Results  and  Conclusions 


Results 

The  numerical  results  of  the  testing  are  presented  in 
Appendix  B,  The  tables  there  show  the  actual  figures  returned 
from  each  execution  of  the  twelve  programs  using  the  twelve 
data  files.  They  are  presented  in  the  appendix  since  a rela- 
tive evaluation  is  more  important  than  actual  comparison  of 
hundreds  of  numbers.  In  fact.  Tables  I,  II,  III,  IV,  and  V 
present  just  such  rank  i.gs  of  the  results.  In  these  tables 
a "1"  identifies  the  algorithm(s)  which  returned  the  "best" 
values  for  a given  test,  and  a "12"  identifies  the  algorithm 
which  returned  the  worst  results.  Algorithms  which  returned 
identical  values  are  given  identical  rankings. 

Perhaps  the  most  important  measure  of  performance  of  a 
pattern  matching  algorithm  is  the  amount  of  time  it  requires 
to  search  through  the  input  text.  Table  I presents  the 
rankings  of  the  search  times  for  the  twelve  programs  and 
twelve  input  files  (actual  times  in  Appendix  B).  It  is 
interesting  to  note  that  no  one  algorithm  performed  best  for 
all  input  cases.  For  example,  EXEC6A  (f.s.a.)  did  "best" 
during  the  searches  involving  a large  number  of  patterns 
(there  were  fifty  patterns)  in  the  data  structure,  but  for  a 
small  number  of  patterns  (ten)  EXEC4B  (alt/suc  linked  list) 

I 

was  best.  Also  interesting,  it  appears  that  the  incidence  ; 

of  patterns  had  little  influence  on  the  comparative  rankings.  I 

I 
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TABLE  I 


6 


However,  no  matter  the  number  of  patterns  or  incidence, 
several  algorithms  are  clearly  just  plain  slow.  EXECIA  (sim- 
ple list)  and  EXECIB  (packed  simple  list)  were  the  "winners" 
in  this  category,  chalking  up  times  like  seventy  seconds 
compared  to  less  than  two  seconds  for  the  fastest  algorithm 
given  the  same  input.  EXEC2A  and  EXEC2B  did  somewhat  better 
using  an  indexed  simple  list  data  structure,  turning  in  times 
one-third  to  one-half  those  of  EXECIA  and  EXECIB. 

Of  the  other  eight  algori thms /data  structures,  the 
times  were  all  very  close;  the  difference  between  number  "1" 
and  number  "8"  not  quite  a factor  of  2.  And  in  all  cases, 
the  difference  between  number  "1"  and  number  "2"  was  never 
greater  than  two  .to  three,  and  often  less. 

Therefore,  the  job  of  choosing  a single  "best"  algorithm 
based  on  search  time  alone  is  difficult  to  do.  However, 
looking  at  Table  I,  the  choice  would  probably  be  made  between 
EXEC4A  (alt/suc),  EXEC6A  (f.s.a.),  and  EXEC4B  (packed  alt/suc). 
Close  together,  but  not  in  contention  for  first  place,  are 
EXEC3A  (binary  tree),  EXEC5B  (alt/suc,  special  search),  and 
EXEC4C  (alt/suc,  packed  input).  And  it  appears  that  EXEC3B 
(packed  binary  tree)  and  EXEC3C  (binary  tree,  packed  input) 
are  out  of  the  running  (along  with  EXECIA,  EXEC2A,  EXECIB, 
and  EXEC2B), 

Another  measure  which  may  influence  the  final  choice  is 
the  size  of  the  data  structure.  Table  II  presents  the  rela- 
tive rankings  for  the  various  algorithms.  An  interesting 
comparison  may  be  made  between  these  and  the  rankings  for 
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TABLE  II 


Size  of  Data  Structure  Rankings 


LARGE  NUMBR  LARGE  NUMBRiSMALL  NUMBRiSMALL  NLIMBRj 
OF  PAHERMS  OF  PATTERNS! OF  PATTERNS  jOF  PATTERNS  i 

l.l  r~' ' UTOU  tl.iTTU  I ni.l  ll.ITTU  II.ITTU  I ni.l  1 


{PROGRAM  (WITH  HIGH  jwlTH  LOW  jWITH  HIGH.  {WITH  LOW 


INCIDENCE  {INCIDENCE  j INCIDENCE  I INCIDENCE 


for! ENG 'RAN! FOR  ENG 


RAN 'FOR  ENG  RAN i FOR  ENG ! ran: 


8 8 i 81  8 I 8 8 2 I 2 i 2 5 1 8 i 2 


9 9 9 9 9 9 9!  91  9 919  9 


lOl  10  i 10 


I 1 j 

i i I 

11  ill  jll  j 


HBrannBi^H 


2 I 12  12 


EXECIB 

EXEC2B 

EXEC3B 

EXEC48 


EXEC3C 

EXEC4C 


for  search  time.  That  is,  the  algorithm  with  the  faster  time 
has  the  larger  data  structure  (as  was  predicted).  For  example, 
one  of  the  slowest  programs,  EXECIB  (packed  simple  list)  had 
the  smallest  data  structure,  while  EXEC6A  (f.s.a.),  one  of  the 
fastest,  had  the  largest  data  structure. 

Figure  27  illustrates  this  relationship  of  search  time 
to  data  structure  size.  In  preparing  this  chart  only  figures 
for  the  top  six  programs  were  used,  and  the  actual  values 
shown  are  averages.  That  is,  each  time  and  size  "bar"  rep- 
resents the  average  of  six  executions-- low  and  high  incidence 
runs  for  each  Fortran,  English,  and  random  text  file.  There- 
fore, the  relative  relationships  illustrated  reflect  typical 
expected  values.  In  any  event,  it  is  clear  that  for  EXEC3A, 
EXEC4A,  and  EXEC6A  as  the  size  of  the  structure  increased 
the  search  time  decreased.  This,  on  the  average,  is  true  for 
all  cases--the  larger  the  structure,  the  faster  the  search. 

However,  notice  some  of  the  packed  versions  required  less 
storage  space  and  were  still  faster  than  unpacked  programs; 
this  is  an  important,  almost  contrary  to  anticipated,  obser- 
vation. For  example,  EXEC4B  (packed  alt/suc)  and  EXEC3A  (bi- 
nary tree)  fit  such  a situation.  EXEC4B  was  always  faster 
than  EXEC3A  and  always  required  less  storage.  But  this  speed 
difference  is  attributable  to  the  difference  in  the  design 
of  the  algorithms/data  structures  not  the  packed/unpacked 
approach.  In  fact,  comparing  again  Table  I and  Table  II,  one 
can  see  that  between  programs  of  the  same  design  (EXECIA  and 
EXECIB,  EXEC2A  and  EXEC2B,  etc.)  packing  had  little  effect 
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on  the  overall  rankings  for  search  time.  (Figure  27  shows 
this  limited  difference  between  programs  EXEC4A  and  EXEC43 . ) 
However,  it  is  important  to  note  that  packing  did  not  always 
increase  search  time  and,  in  some  cases,  even  allowed  a 
decrease  in  time. 

' The  idea  of  design  brings  up  another  measure  which  may 

be  used  to  evaluate  overall  efficiency  of  a given  program. 
This  is  the  measure  reflecting  the  number  of  equality  checks 
required  to  find  patterns  in  the  input  text.  The  relative 
rankings  of  this  measure  are  shown  in  Table  III.  (A  "1", 
the  best,  required  fewest  checks.)  One  would  expect  these 
rankings  to  consistent  with  those  of  search  time;  that  is, 
the  algorithm  making  the  fewer  checks  should  be  faster  than 
another  making  more.  However,  with  the  exception  of  EXECIA, 
EXEC2A,  EXECIB,  and  EXEC2B,  this  is  not  necessarily  the 
^ situation. 

j For  example,  EXEC3B  and  EXEC3C,  which  pretty  much  held 

down  positions  seven  and  eight  in  search  time,  rank  well  in 
the  number  of  match  checks  made.  And  of  the  original  fast 
' trio  (EXEC4A,  EXEC6A,  and  EXEC4B)  only  EXEC6A  (f.s.a.)  shows 

a consistently  high  ranking.  Why  then,  this  disparity 

between  search  time  and  the  number  of  match  checks?  That  is, 
why  do  the  faster  programs  not  necessarily  make  the  fewer 
match  checks?  Table  IV  may  give  some  idea  of  the  answer  to 
' this  question. 
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TABLE  III 


Rankings-  Number  of  Match  Checks 
Made  to  Find  Patterns  In  Text 


PROGRAM 

NAME 

LARGE  NUMBR 
OF  PAHERNS 
WITH  HIGH 
INCIDENCE 
(LH) 

LARGE  NUMBR 
OF  PATTERNS 
WITH  LOW 
INCIDENCE 
(LL) 

SMALL  NUMBR 
OF  PATTERNS 
WITH  HIGH 
INCIDENCE 
(SH) 

SMALL  NUMBR 
OF  PAHERNS 
WITH  LOW 
INCIDENCE 
(SL) 

FOR 

ENG : PAN 

FOR 

ENG, 

RAN 

FOR  ENG ‘pan 

— 

FOR 

ENG 

■ 1 

RAN 

EXEC1A 

■ 

11 

11 

11 

11  ! 11 

n 

11 

" 

9 

9 

9 

9 

9 

1 

9 1 9 

9 

9 

9 

EXEC3A 

r 

1 

5 

■ 

6 

B 

1 

1 

3 1 2 

6 

3 

1 

2 

— - t 

3 

1 

3 

3 

5 

3j  4|  1 

5 

5 

4 1 

EXEC6A 

3 

B 

1 

B 

1 

1 

B 

2 

3i 

I 

EXECIB 

11 

EXEC2B 

9 

9 

9 

9 

9 

9 

9 

9 

9 

1 

9 

9 

EXEC3B 

6 

B 

1 

6 

B 

B 

3 

2 

B 

6 

3 

1 

EXEC4B 

8 

8 

8 

8 

8 

8 

8 

8 

8 

8 

B 

8 

EXEC5B 

5 

B 

6 

5 

B 

6 

2 

6 

5 

3 

8 

6 

EXEC3C 

2 

2 

5 



2 

2 

3 

3 

5 

B 

2 

1 

5 

EXEC4C 

B 

6 



B 

1 

6 

B 

3 

6 

L—  „ 

6 

L 

1 

B 

1 
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able  IV  gives  the  rankings  for  the  total  number  of 


comparisons  made  during  the  search.  Recall  that  this  measure 
was  chosen  to  provide  some  indication  of  the  overall  amount 
of  processing  involved  in  the  search  process.  And  as  can  be 
seen  in  Table  IV,  EXEC3B  (packed  binary  tree)  does  appear  to 
have  the  highest  amount  of  equality  testing  of  the  eight 
fastest  programs.  On  the  other  hand,  EXEC3C  (binary  tree, 
packed  input)  does  not  fair  as  badly,  especially  in  the  large 
pattern  structures.  This  is  a contraindicati ve  result  when 
compared  to  search  time  ranking  for  EXEC3C  which  places  it 
a solid  ”8".  The  observation  of  thi.s  result  caused  this 
author  to  restudy  the  coding  for  EXEC3C.  It  was  found  that 
three  comparisons  were  not  being  counted  as  they  should. 
Therefore,  because  of  the  close  agreement  with  search  time 
rankings  for  the  other  programs,  it  will  be  assumed  that  had 
these  comparisons  been  counted  the  correct  relationships 
would  be  reflected.  Another  reason  for  this  claim  is  that 
the  processing  for  EXEC3C  was  actually  more  involved  than 
EXEC4C  and  should  compare  similarly  as  EXEC3B  compares  to 
EXEC4B , a ratio  of  almost  two  to  one. 

One  final  measure  which  must  be  discussed  before  pre- 
senting the  conclusions  reflects  the  amount  of  time  spent 
building  the  data  structure.  It  is  true  that  this  time  is 
trivial  when  compared  to  search  time  (numbers  in  Appendix  B). 
However,  if  one  is  speaking  of  a dynamic  data  structure  this 
can  become  an  important  consideration.  The  rankings  of  this 
measure  are  shown  in  Table  V. 
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TABLE  IV 


Rankings-  Number  of  Comparisons 
Made  During  Search 


PROGRAM 
; ;\AME 


EXECIA 


i EXEC2A 


EXEC3A 


EXEC4A 


EXEC6A 


EXEC2B 


EXEC3B 


EXEC4B 


EXEC5B 


EXEC3C 


EXEC4C 


LARGE  NUMBR 
OF  PATTERNS 
iWITH  HIGH 
INCIDENCE 
(LH) 


IB  1 12 


FOR 


ENG ! RAN 


11  i n i 11 


1 ! 1 


I I 


10 


12 


12 


10  I 10 


8 I 8 8 

! 1 


4 i 3 


LARGE  NUMBRISMALL  NUMBR ISMALL  NUMBR: 
OF  PATTERNS iOF  PATTERNS  I OF  PATTERNS i 


WITH  LOW  iWITH  HIGH 


INCIDENCE 

(LL) 


INCIDENCE 

(SH) 


WITH  LOW 
INCIDENCE  i 
(SL)  I 


FOR  ENG i RAN  > FOR ' ENG ' RAM  i FOR : ENG  RAN  : 


11  ; 11 


9 9 


11 


3 2 


1 I 3 


12  12 


10  1 10 


8 8 


12 


10 


4 3 


6 I 6 


11  Ml  i 11 


11  111 

I 


i I 

11 


9 ; 9 9 I 9 9 


5 4 13  6 


4 1 2 


3 i 3 


8 i 8 i 8 


12  112  112 


10 


10- 


10 


3 I 4 


1 i 1 


12  12 


10 


7 7 


10 


12 


10 
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From  Table  V one  can  see  that  there  does  not  appear  to 


be  a clear-cut  "winner".  However,  it  appears  there  are  some 
clear-cut  "losers".  For  example,  EXECIA,  EXECIB,  EXEC2A, 
EXEC2B,  and  EXEC6A  seem  to  fill  up  the  last  five  rankings; 
though  admittedly,  there  are  some  exceptions  for  EXEC2A  and 
EXEC2B.  These  exceptions  are  due  to  the  small  number  of 
patterns  in  the  data  structure  and  the  limited  conflicts  in 
index  value.  That  is,  for  all  programs  (except  EXEC6A)  the 
time  required  to  build  the  data  structure  is  basically  a 
combination  of  first,  verifying  that  a pattern  does  not  already 
exist  in  the  structure,  and  then,  second,  adding  the  new 
pattern  to  the  structure.  Therefore,  because  EXEC2A  and 
EXEC2B  are  so  uncomplicated  (using  an  indexed  simple  list), 
they  were  able  to  store  a small  number  of  patterns  very 
quickly.  (Note,  this  advantage  disappears  if  several 
patterns  share  the  same  index  value.) 

Now,  EXEC6A  was  just  mentioned  as  a special  case  in  its 
construction  of  the  data  structure.  This  is  because  after 
all  patterns  have  been  stored  in  the  data  structure,  then 
the  fail  values  for  each  state  must  be  calculated.  (Recall 
the  finite  state  automata  data  structure  discussed  in 
Chapters  I and  II.)  This  "extra"  time  is  approximately  the 
time  difference  for  EXEC4A  and  EXEC6A  to  construct  the  same 
data  structure.  On  the  average,  EXEC6A  required  three  times 
as  much  time;  and  sometimes  as  much  as  six  or  seven  times. 


Then,  in  light  of  a dynamic  data  structure,  EXEC6A  will 
not  fair  well  in  construction  time  when  compared  to  the  other 


1 . ..  . . S'. 


TABLE  V 


Rankings-  Time  Required  To 
Construct  Data  Structure 


methods,  excluding,  of  course,  the  slowest  programs. 
Conclusions 

It  is  difficult,  if  not  impossible,  to  pick  a single 
overall  "best"  algorithm.  But  the  choice  does  point  to  a 
group  of  three,  EXEC4A,  EXEC5A,  and  EXEC4B  . Not  surprisingly, 
all  three  are  based  on  the  alternate/successor  linked  list 
and  hence,  limit  "back-up".  EXEC4A  uses  a straight-forward 
unpacked  approach;  EXEC4B  uses  a packed  version  of  the  same, 
cutting  space  requirements  by  a factor  up  to  four  and  more; 
and  EXEC6A  uses  an  unpacked  and  more  complicated  finite  state 
automata  version  requiring  the  largest  data  structure  and 
most  time  to  construct. 

From  these  three  programs  the  final  choice  can  be  based 
on  expected  input  and  answers  somewhat,  the  questions  pre- 
sented in  Chapter  III.  That  is,  if  the  input  is  to  involve 
a large  static  pattern  structure,  then  EXEC6A  is  the  best 
choice.  However,  if  the  pattern  structure  is  small,  or 
especially  if  it  is  dynamic,  then  the  best  choice  is  EXEC4A 
or  EXEC4B,  and  the  choice  between  them  should  be  based  on 
the  space  available  for  the  data  structure. 

Equally  important  to  identify  are  the  worst  programs. 
Clearly,  the  simple  list  and  even  the  indexed  simple  list 
(EXECIA,  EXEC2A,  EXECIB,  and  EXEC2B)  are  unacceptable, 
regardless  of  their  relative  simplicity.  The  binary  tree 
may  have  given  better  results  if  a better  indexing  method 
were  used  (EXEC3A,  EXEC33,  and  EXEC3C),  but  a subjective 
evaluation  points  to  an  indexed  linked  list  (not  implemented) 
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as  a probable  better  choice. 

A " double- sided"  conclusion  can  be  made  regarding  those 
programs  designed  to  take  advantage  of  the  CYBER  74  large 
storage  word.  On  one  hand,  it  has  been  shown  that  shrinking 
the  size  of  the  data  structure  by  packing  does  not  adversely 
affect  the  speed  of  the  search  algorithm,  and  in  some  cases, 
the  search  can  be  faster.  On  the  other  hand,  attempting  to 
use  packed  input  and  a packed  data  structure  to  perform 
packed  (up  to  ten  character)  comparisons  did  not  fare  so  well. 

There  are  several  reasons  why  this  packed  comparison 
approach  did  not  perform  as  was  hoped.  For  one,  the  overhead 
in  keeping  track  of  search  position  in  the  input  text  and 
within  each  pattern  overshadowed  the  benefits  given  by 
multi-character  comparisons.  However,  these  results  can  in 
part,  be  blamed  on  the  inefficiency  of  using  the  high  level 
language,  Fortran.  In  the  following  recommendations  chapter 
a proposal  to  further  investigate  multi-character  comparisons 
will  be  presented. 
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V.  Recoramendat ions 


It  is  in  the  spirit  of  resolving  the  many  questions 
raised  during  this  thesis  that  the  following  recommendations 
are  made . 

1.  As  mentioned  earlier  in  Chapter  I,  one  user  of  the 
pattern  matching  algorithm  is  the  preprocessor.  Mortran  is 
one  such  example,  and  it  is  recommended  that  one  of  the 
"best"  versions  should  be  implemented  within  the  Mortran 
processor.  It  should  be  a packed  design  since  space  is 
critical.  The  probable  choice  is  EXEC4B  (packed  alternate/ 
successor  data  structure). 

2.  A possibility  for  improving  the  relative  worth  of 
an  indexed  simple  list  (EXEC2A  and  EXEC2B)  might  be  to 
investigate  better  hashing  (indexing)  techniques  rather  than 
first  character  as  was  used. 

3.  Also  in  this  idea  of  a better  hash  technique,  perhaps 

a linked  list  data  structure  should  be  implemented,  which,  as  ^ 

; 

»; 

noted  before,  this  author  feels  would  serve  as  well  or 
posiibly  better  than  a binary  tree,  especially  for  a small 
- iw- zf  patterns  in  the  structure. 

* 'as*  ;a**ern  matching  algorithm  (Ref  2)  was  men- 
• ■ * ' be  interesting  to  see  how  well 

• • 1 f.g  '.•‘■m  would  fair  when 

-'■#  lasign  as 


« 


the  input  text  with  a single  pattern. 

5.  EXEC6A  might  be  modified  to  achieve  the  "determin- 
istic" finite  state  automata  as  described  by  authors  Aho  and 
Corasick  (Ref  1).  At  the  same  time  a packed  version  might 
be  considered.  This  approach  might  make  EXEC6A  look  better 
both  from  a search  time  viewpoint  and  from  the  amount  of 
storage  space  required. 

6.  Other  improvements  to  search  time  should  be  con- 
sidered. For  example,  heuristic  futility  checks,  such  as 
"is  there  enough  text  left  to  match  this  pattern?"  might  be 
implemented.  In  this  line  of  thought,  an  article  by  Malcolm 
Harrison  (Ref  8)  may  be  used  to  investigate  an  interesting 
process  which  uses  "hashing  signatures"  to  identify  patterns 
and  subject  strings.  In  a way,  this  hashing  technique  applies 
a probabilistic  futility  check,  i.e.,  "what  is  the  chance 
that  this  pattern  is  contained  in  this  line  of  text?" 

7.  Better  advantage  may  be  taken  of  the  CYBER  74  hard- 
ware through  use  of  Compass  (CDC  assembly)  language  coding. 
Perhaps  this  is  easier  said  than  done,  but  some  ideas  that 
come  to  mind  are,  one,  use  a single  register  to  hold  ten 
characters  of  input  text  during  the  search  to  eliminate 
repeated  memory  references,  and,  two,  use  a single  register 
from  which  to  unpack  a single  memory  location  rather  than 
referencing  the  same  repeatedly. 

So,  these  above  seven  recommendations  are  offered  as 
continuation  to  the  work  begun  in  this  thesis.  In  particular, 

75 


Compass  coding  may  offer  unique  data  structures  and  algorithms 
for  the  CDC  machine  which  will  outperform  some  of  the  very 
sophisticated  general  purpose  approaches.  And  from  the 
beginning,  the  purpose  of  this  thesis  has  been  to  provide  a 
foundation  upon  which  such  future  investigation  may  be  built. 
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APPENDIX  B 


Numerical  Results  of  Test  Program  Executions 
Using  Data  Files  Described  in  Figure  25 
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